
In recent years, AirBnb has revolutionized the lodging industry. No longer are consumers limited to expensive hotels, but instead can have access to a diverse set of options in location, pricing, and amenditities. We explore what factors are important in the ranking of so called "Superhosts" who may receive preferential bookings from customers. Further, we create a model to predict the average rating of a listing based on auxiliarily information. Understanding these factors could help AirBnb renters to better understand how they can optimize their listings.
Meanwhile an entire industry has sprung up to meet the demand for short-term rentals. Some hosts use Airbnb to subsidize part time living: simply rent out their home in the summer/winter months or if they are out of town for an extended period of time. Others may rent out a spare unit in a large apartment to generate additional revenue. Conversely, some hosts own multiple properties for the sole intent of leasing them on AirBnb. We explore if machine learning algorithms are able to classify these types of hosts based of their rentals which could help regulators understand this new complicated industry.
This page will act partly as an investigation and a tutorial. We will be walking through the various steps of the data science pipeline. There are hundreds of articles about machine learning on AirBnB data, but none walk you through the process from start to finish. This is what we aim to accomplish here.
AirBnb provides an extensive public AirBnb listings dataset. In this project, we decided to focus on listings in New York City's five boroughs. For easy code reproduction, we included a permalink to the dataset, which we download directly through pandas. We show an off representative listing of the dataset.
import pandas as pd
import numpy as np
from plotnine import *
df = pd.read_csv('http://data.insideairbnb.com/united-states/ny/new-york-city/2020-04-08/data/listings.csv.gz')
df.iloc[3654:3655, :]
Looking at the above dataframe, we see that it contains over a hundred columns, many of which contain irrelevant or duplicate information that are not directly suitable for data science. In below sections, we identity of subset of useful columns and perform the necessary preprocessing to coerce these columns into the correct format.
Further, as seen above, many of these columns contain null values. Given the massive size of the AirBnb public dataset, we need a robust method to evaluate which features of the dataset have sufficient information to explore. Let's percentages of entities in a feature column that non-null.
A thorough guide on data preprocessing can be found below. https://towardsdatascience.com/data-cleaning-series-with-python-part-1-24bb603c82c8?gi=76b4899af990
features_to_evaluate = ['square_feet', 'transit', 'host_response_time', 'is_business_travel_ready']
def evaluateList(df, features_to_evaluate):
total_entities = df.shape[0]
print(f"Percentage of Valid Entities For Each Feature")
for feature in features_to_evaluate:
has_feature = list(df[feature].isnull())
non_null_count = sum([1 for x in has_feature if not x])
print(f"Total of {100 * round(non_null_count / total_entities,3)}% for {feature}")
evaluateList(df, features_to_evaluate)
So let's avoid using square feet, but it seems we could get sufficient information from these other columns.
import math
info_about_listing = ['beds', 'is_location_exact', 'property_type', 'price', 'availability_365', 'is_business_travel_ready', 'review_scores_rating','room_type']
host_info = ['host_response_time', 'host_response_rate', 'host_acceptance_rate', 'host_total_listings_count', 'host_is_superhost', 'host_id']
# Grab Subset of Columns
bnb_df = pd.DataFrame(data = df[info_about_listing + host_info])
# Convert df to correct types
bnb_df['host_is_superhost'] = bnb_df['host_is_superhost'].apply(lambda x: True if x=="t" else False)
bnb_df['is_location_exact'] = bnb_df['is_location_exact'].apply(lambda x: True if x=="t" else False)
bnb_df['price'] = pd.to_numeric(bnb_df['price'].apply(lambda x: x.lstrip("$").replace(".00", "").replace(",","")))
# Subset the apartment types
apartment_types = ['Apartment', "Guest suite", "Townhouse", "Hotel"]
def simplifyApartmentType(type):
if type in apartment_types:
return type
else:
return "Other"
bnb_df['property_type'] = bnb_df['property_type'].apply(simplifyApartmentType)
# Changes a percent to an int, keeps NaN
def change_to_percent(x):
if type(x) == float and math.isnan(x):
return float('nan')
else:
return float(x[:-1])
# Converts response and acceptance to percent
bnb_df['host_response_rate'] = bnb_df['host_response_rate'].apply(change_to_percent)
bnb_df['host_acceptance_rate'] = bnb_df['host_acceptance_rate'].apply(change_to_percent)
bnb_df.head()
bnb_df.query("price > 10000")
To accomplist this we first need to expand on the bnb_df that we worked on before in order to include a few more features.
rating_pred = bnb_df.copy(deep=True)
rating_pred[['neighbourhood', 'room_type', 'accommodates', 'bathrooms', 'bedrooms', 'beds', 'bed_type','guests_included']] = df[['neighbourhood', 'room_type', 'accommodates', 'bathrooms', 'bedrooms', 'beds', 'bed_type', 'guests_included']]
rating_pred = rating_pred.drop(columns=['is_location_exact', 'host_response_time'])
Next, we need to process this data in order to run it. We are going to use a DecisionTreeClassifier, but in order to do this we need all variables in numerical types. Thus, we first need to change boolean to 1 or 0.
# Convert 't'/'f' to 1/0 respectively
def change_char_to_int(x):
if x == 't':
return 1
else:
return 0
# Convert True/False to 1/0 respectively
def change_bool_to_int(x):
if x:
return 1
else:
return 0
rating_pred['is_business_travel_ready'] = rating_pred['is_business_travel_ready'].apply(change_char_to_int)
rating_pred['host_is_superhost'] = rating_pred['host_is_superhost'].apply(change_bool_to_int)
Finally, we need to convert categorical variables into numerical. In order to do this using the pandas method get_dummies. This takes a variable and adds as many columns as possible outcomes for the variable, and sets 1 to the column that the variable actually is. For example bed_type can either be Regular, Futon, Couch.... This then adds new columns bed_type_isRegular, bed_type_isFuton etc. and sets the proper column to 1.
# Encode the categorical columns as numeric
rating_pred = pd.get_dummies(rating_pred, columns=['property_type', 'room_type', 'neighbourhood', 'bed_type'])
# remove columns with NaN as they are generally premature AirBnBs
rating_pred = rating_pred.dropna()
rating_pred
(ggplot(bnb_df, aes(x='review_scores_rating'))
+ geom_histogram(bins=20)
+ labs(title="Review Scores of Property Listings",
x = "Review Score",
y = "Count"))
The above histogram shows the distribution of review scores for all listings in the NYC dataset. As you can see, many listings tend to have an overall high rating with a few trailing scores. This dataset also includes removed listings, which may help to reduce survivorship bias leading to higher averages.
# df[['host_id', 'host_total_listings_count']]
unique_hosts = df[['host_id', 'host_total_listings_count']].groupby('host_id').agg('mean')
(ggplot(unique_hosts.query("host_total_listings_count < 10"), aes(x='host_total_listings_count'))
+ geom_histogram(bins=10)
+ labs(title="Number of Listings For Hosts",
x = "Number of Listings",
y = "Life Expectancy (Years)"))
The above histogram shows the total number of listings per unique host. From the above histogram, a majority of hosts have just a single listing, but a non-neglible amount of hosts own significantly more. Can we identify these differences in a systematic way?
(ggplot(bnb_df[bnb_df['price'] < 1000].replace({'host_is_superhost': {0:'Not Superhost',1:'Superhost'}}), aes(x='review_scores_rating', y='price'))
+ geom_point()
+ facet_grid("property_type~host_is_superhost")
+ geom_smooth(method='lm', color="red")
+ labs(title="Rating Vs Price Conditioned on Host Information",
x = "Review Scores",
y = "Price"))
We can clearly see a relationship that super hosts tend to have a higher rating (not surprising), but is there a deeper relationship between price, ratings, and the super hosts designation that we can uncover?
Here we show the distribution of Airbnb’s in New York City by using a heatmap with ipyleaflet. It appears that most Airbnb’s are in Manhattan and Brooklyn, while the other Burroughs have some but with a lower density.